Full Name:
Online Assignments: There is a new online-assignments at the DataCamp. This online exercise, related to data visualization in R, e.g., by using the ggplot2 package. These exercises are useful to prepare yourself for the computer-lab. The online-assignments at the DataCamp are not mandatory.
Your task is to answer the questions in this R-markdown file. Submit both your R-markdown (.Rmd) file and the HTML file on Canvas.
Note: The exercise of this week has 100 points. Besides, the Bonus part has 30 extra points.
Customer Churn is a topic that matters to organizations of all sizes. Customer churn occurs when customers stop doing business with a company, also known as customer attrition. Churn (loss of customers to competition) is a major problem for telecom companies because it is well known that it is more expensive to acquire a new customer than to keep an existing customer. Here, we use Exploratory Data Analysis to explore the churn dataset. Basically, we want to visualize and identify which factors contribute to customer churn.
Dataset: The churn data set is available in
the R package liver. The
data set contains ‘5000’ rows (customers) and 20 columns (features). The
last column called churn is the target variable which indicates
whether customers churned (left the company) or not. If you want to know
more about the dataset just type ?churn in your
R console. You also can find more information about
this dataset here.
Here we need to load the following R packages:
ggcorr() function
from this package.pairs.panels()
function from this package.NOTE: If you have not installed those two packages, you should first install them.
To load the packages:
library( ggplot2 )
library( liver )
library( GGally )
library( psych )
library( skimr )
library( Hmisc )
library( plyr )
library( ggpubr )Companies are interested to know who is gonna get churned so they can proactively go to the customer to provide them better services and turn customers’ decisions in the opposite direction. Companies are interested to know:
To answer these questions here, as a practical example, we use the churn data set is available in the R package liver.
This dataset comes from IBM Sample Data Sets. The data set contains 5000 rows (customers) and 20 columns (features). The “churn” column is our target which indicates whether the customer churned (left the company) or not. The 20 variables are:
state: Categorical, for the 51 states and the District
of Columbia.area.code: Categorical.account.length: count, how long account has been
active.voice.plan: Categorical, yes or no, voice mail
plan.voice.messages: Count, number of voice mail
messages.intl.plan: Categorical, yes or no, international
plan.intl.mins: Continuous, minutes customer used service to
make international calls.intl.calls: Count, total number of international
calls.intl.charge: Continuous, total international
charge.day.mins: Continuous, minutes customer used service
during the day.day.calls: Count, total number of calls during the
day.day.charge: Continuous, total charge during the
day.eve.mins: Continuous, minutes customer used service
during the evening.eve.calls: Count, total number of calls during the
evening.eve.charge: Continuous, total charge during the
evening.night.mins: Continuous, minutes customer used service
during the night.night.calls: Count, total number of calls during the
night.night.charge: Continuous, total charge during the
night.customer.calls: Count, number of calls to customer
service.churn: Categorical, yes or no. Indicator of whether the
customer has left the company (yes or no).We import the dataset in R as follows:
data( churn ) # load the "churn" datasetTo see the overview of the dataset in R we could use the following functions:
str to see a compact display of the structure of the
data.View to see spreadsheet-style data.head to see the first part of the data (first 6 rows of
the data).summary to see the summary of each variable.To see the overview of the dataset in R we are using
function str as follows:
str( churn ) # Compactly display the structure of the data 'data.frame': 5000 obs. of 20 variables:
$ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
$ area.code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
$ account.length: int 128 107 137 84 75 118 121 147 117 141 ...
$ voice.plan : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
$ voice.messages: int 25 26 0 0 0 0 24 0 0 37 ...
$ intl.plan : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
$ intl.mins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
$ intl.calls : int 3 3 5 7 3 6 7 6 4 5 ...
$ intl.charge : num 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
$ day.mins : num 265 162 243 299 167 ...
$ day.calls : int 110 123 114 71 113 98 88 79 97 84 ...
$ day.charge : num 45.1 27.5 41.4 50.9 28.3 ...
$ eve.mins : num 197.4 195.5 121.2 61.9 148.3 ...
$ eve.calls : int 99 103 110 88 122 101 108 94 80 111 ...
$ eve.charge : num 16.78 16.62 10.3 5.26 12.61 ...
$ night.mins : num 245 254 163 197 187 ...
$ night.calls : int 91 103 104 89 121 118 118 96 90 97 ...
$ night.charge : num 11.01 11.45 7.32 8.86 8.41 ...
$ customer.calls: int 1 1 0 2 3 0 3 0 1 0 ...
$ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
It shows that data are as a data.frame object in R with 5000 observations and 20 variables. The last column (with name churn) is the target variable that indicates whether customers churned (left the company) or not.
By using the function summary in R, we
can see the summary of the dataset as follows
summary( churn ) state area.code account.length voice.plan
WV : 158 area_code_408:1259 Min. : 1.0 yes:1323
MN : 125 area_code_415:2495 1st Qu.: 73.0 no :3677
AL : 124 area_code_510:1246 Median :100.0
ID : 119 Mean :100.3
VA : 118 3rd Qu.:127.0
OH : 116 Max. :243.0
(Other):4240
voice.messages intl.plan intl.mins intl.calls intl.charge
Min. : 0.000 yes: 473 Min. : 0.00 Min. : 0.000 Min. :0.000
1st Qu.: 0.000 no :4527 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
Median : 0.000 Median :10.30 Median : 4.000 Median :2.780
Mean : 7.755 Mean :10.26 Mean : 4.435 Mean :2.771
3rd Qu.:17.000 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
Max. :52.000 Max. :20.00 Max. :20.000 Max. :5.400
day.mins day.calls day.charge eve.mins eve.calls
Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
Median :180.1 Median :100 Median :30.62 Median :201.0 Median :100.0
Mean :180.3 Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
Max. :351.5 Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
eve.charge night.mins night.calls night.charge
Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
Median :17.09 Median :200.4 Median :100.00 Median : 9.020
Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
customer.calls churn
Min. :0.00 yes: 707
1st Qu.:1.00 no :4293
Median :1.00
Mean :1.57
3rd Qu.:2.00
Max. :9.00
It shows the summary of all the 20 variables.
a. For each variable in the churn dataset, specify its type.
| Variable | R Datatype | Statistical Datatype |
|---|---|---|
state |
Factor |
categorical - nominal |
area.code |
Factor |
categorical - nominal |
account.length |
int |
numerical - discrete |
voice.plan |
Factor |
categorical - binary |
voice.messages |
int |
numerical - discrete |
intl.plan |
Factor |
categorical - binary |
intl.mins |
num |
numerical - continuous |
intl.calls |
int |
numerical - discrete |
intl.charge |
num |
numerical - continuous |
day.mins |
num |
numerical - continuous |
day.calls |
int |
numerical - discrete |
day.charge |
num |
numerical - continuous |
eve.mins |
num |
numerical - continuous |
eve.calls |
int |
numerical - discrete |
eve.charge |
num |
numerical - continuous |
night.mins |
num |
numerical - continuous |
night.calls |
int |
numerical - discrete |
night.charge |
num |
numerical - continuous |
customer.calls |
int |
numerical - discrete |
churn |
Factor |
categorical - binary |
b. Based on the output of the summary function
for the churn dataset, what is the number of customers who have an
international plan (intl.plan = "yes")?
intl_plan_holder = length(which(churn $ intl.plan == "yes"))
intl_plan_holder [1] 473
The customer churn rate is 473.
Here we report a bar plot for the target variable churn
by using function ggplot() from the R
package ggplot2 as follows:
ggplot( data = churn ) +
geom_bar( aes( x = churn ), fill = c( "red", "blue" ) ) +
labs( title = "Bar plot for the target variable 'churn'" ) Summary for the target variable churn
summary( churn $ churn ) yes no
707 4293
Based on the above output, what is the proportion of the churner (customer churn rate)?
\(P(churn = no) = 4293\)
\(P(churn = yes) = 707\)
\(n = P(churn = no) + P(churn = yes) = 4293 + 707 = 5000\)
\(P(churn = yes) = \frac{P(churn = yes)}{n} = \frac{707}{5000} = 0.1414 = 14.14\%\)
The proportion of churners is \(14.14\%\).Here we first report a contingency table of International Plan
(intl.plan) with churn
table( churn $ churn, churn $ intl.plan, dnn = c( "Churn", "International Plan" ) ) International Plan
Churn yes no
yes 199 508
no 274 4019
Here is the above contingency table with margins
addmargins( table( churn $ churn, churn $ intl.plan, dnn = c( "Churn", "International Plan" ) ) ) International Plan
Churn yes no Sum
yes 199 508 707
no 274 4019 4293
Sum 473 4527 5000
Bar chart for International Plan
ggplot( data = churn ) +
geom_bar( aes( x = intl.plan, fill = churn ) ) +
scale_fill_manual( values = c( "red", "blue" ) )
ggplot( data = churn ) +
geom_bar( aes( x = intl.plan, fill = churn ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) What would be your interpretation of the above plots?
The first plot depicts a bar chart of intl.plan, with
churn overlay. The chart shows that approximately 500
customers have an international plan, while the rest do not. The
proportions of churners is not clearly readable, but approximately half
of the customers who have an international plan tend to churn. The churn
rate is much smaller for customers who do not have international
plan.
The right chart is the standardised version of the bar chart on the
left. The standardised chart’s y-axis ranges between 0 and 1, and it
shows the proportions of churners in each intl.plan
category. This chart allows for more precision; we read that
approximately 42% of international plan holders change to a different
company, while this number is around 11% for those not having an
international plan.
Conclusively, international plan holders churn rate is about four
times higher than for non subscribers. The variable in question may bear
with explanatory power, hence it would not come as a surprise if the
data mining algorithm would use intl.plan in the prediction
model.
Make a table for counts of Churn and Voice Mail Plan
addmargins(table( churn $ churn, churn $ voice.plan, dnn = c( "Churn", "Voice Mail Plan" ) )) Voice Mail Plan
Churn yes no Sum
yes 102 605 707
no 1221 3072 4293
Sum 1323 3677 5000
Bar chart for Voice Mail Plan
ggplot( data = churn ) +
geom_bar( aes( x = voice.plan, fill = churn ) ) +
scale_fill_manual( values = c( "red", "blue" ) )
ggplot( data = churn ) +
geom_bar( aes( x = voice.plan, fill = churn ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) What would be your interpretation of the above plots?
The left plot depicts a bar chart for voice.plan, with
churn overlay, grouped by voice.plan
categories. Those of voice plan holders represent themselves with 1300
customers, while those of non-holders about 3700 customers.
The right plot is a standardised bar chart, and shows that approximately 8% of voice mail plan holders churn. For those who did not opt for voice mail plan, the churn rate is approximately twice as much, 16%.
Conclusively, churn rate is higher among customers who do not have
voice mail plan, than for those who do have. voice.plan may
have some explanation for the 14.14% churn rate, and hence we expect
that the data mining algorithm will include it in prediction model.
Here, we are interested to investigate the relationship between
variable “customer service calls” and the target variable
“churn”. First, we report the histogram of the variable
“customer service calls” by using function ggplot
as follows
ggplot( data = churn ) +
geom_bar( aes( x = factor( customer.calls ) ) ) To see the relationship between variable “customer service calls” and the target variable “churn”, we report the histogram of the variable “customer service calls” including “churn” overlay as follows
ggplot( data = churn ) +
geom_bar( aes( x = factor( customer.calls ), fill = churn ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) ) We also report the Normalized histogram of variable “customer service calls” including “churn” overlay as follow
ggplot( data = churn ) +
geom_bar( aes( x = factor( customer.calls ), fill = churn ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) What would be your interpretation of the above plots?
The first plot is a regular bar chart of customer.calls,
and it shows positively a skewed distribtuion, whit a mean of 1.5704
number of calls. The skewness of the data may be explained by that the
majority of customers who contacts the customer service desk had their
issue resolved within one call, and hence there was no need for further
calls.
The second plot is similar to the first chart, however, it adds
churn overlay. The overlay expresses the count of the
specific customer.calls bar, how many have versus have not
churned. Since it is challenging to read the proportions of the
histogram we turn to the standardised chart, which is the last plot of
this section.
The standardised chart depicts the proportions of churners vs non-churners for each customer call counts. We observe that that churn rate is nearly constant at around 11% for customers who had 3 or less customer service calls. The churn rate, however, at least quadruples for those contacting the customer service desk at least 4 times. It seems, that customers tolerance or satisfaction drops significantly as soon as they need to call at least 4 times the customer service.
It is worth to note, that the standardised and non-standardised plots must be used together, otherwise it may lead to misleading observations. Such as, had we not taken into account the non-stardardised bar chart, then we would have concluded that customers calling the service desk 9 times churn with certainty. However, for this observation, the data is not representative, because of the low sample size - i.e, there are only a few customers with 9 customer services calls.
Given the strong graphical evidence of predictive importance of
customer.calls, it is expected that the data mining
algorithm will include customer.calls in the model.
Here, we are interested to investigate the relationship between variable Day Minutes and the target variable Churn. First, we report the “Normalized” histogram of Day Minutes including Churn overlay:
ggplot( data = churn ) +
geom_histogram( aes( x = day.mins, fill = churn ), position = "fill", binwidth = 25, color="white" ) +
scale_fill_manual( values = c( "red", "blue" ) ) Another way to see the relationship between variable Day Minutes and the target variable churn, would be by using the boxplot as follows
ggplot( data = churn ) +
geom_boxplot( aes( x = churn, y = day.mins ), fill = c( "red", "blue" ) ) What would be your interpretation of the above boxplot?
The above boxplots depict the locality, spread and skewness of
day.mins grouped by churn categories.
Comparison of location: For both churn
categories, that is for churners and non-churners, the distribution is
approximately symmetric. This finding bears with the advantage, that we
can state that the distributions’ mean and median are approximately at
the same location. Knowing this, allows us to draw that the daily length
of calls for churners is approximately on average 215 minutes. The
length of the calls for those of non-churners is on average around 190
minutes a day.
To allow for a more precise analysis, we need to plot the data as frequency density plot; the below code performs that.
ggplot( data = churn) +
geom_density( aes( x = day.mins, fill = churn ), alpha = 0.3)We observe, that churners show bimodality; at 160 minutes, and 270
minutes. That is, customers with daily phone calls of 160, and 270
minutes churn with the highest rate. Furthermore, we can read that
customer churn rate increases as day.mins exceeds 200
minutes.
Comparison of dispersion: From the boxplots we read
that the interquartile range for churners is approximately twice as
large than for non-churners. Additionally, we read from the width of the
whiskers, that the range of day.mins for churners is wider
than for non-churners.
Comparison of skewness: day.mins for
churners is skewed to the left - churners spend more time on average on
phone calls, than non-churners.
Comparison of potential outliers:
day.mins’s whisker spreads from the minima of the data to
the maxima, and hence it does not have any outliers, or unusual values.
This cannot be stated for non-churners, since its boxplot reports
potential outliers beyond the lower and upper whiskers.
day.minutes seem to hold relevant information, and
therefore we should expect the data mining algorithm to select this
variable into the model.
Here, we are interested to investigate the relationship between
variable International Calls and the target variable churn.
First, we report the histogram of the variable International Calls as
follows:
ggplot( data = churn ) +
geom_bar( aes( x = intl.calls ) ) To see the relationship between variable International Calls and the target variable churn, we report the histogram of variable International Calls including Churn overlay as follow
ggplot( data = churn ) +
geom_bar( aes( x = intl.calls, fill = churn ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) ) We also report the Normalized histogram of variable International Calls including Churn overlay as follow
ggplot( data = churn ) +
geom_bar( aes( x = intl.calls, fill = churn ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) To see the relationship between variable International Calls and the target variable churn, we report the boxplot as follow
ggplot( data = churn ) +
geom_boxplot( aes( x = churn, y = intl.calls ), fill = c( "red", "blue" ) ) What would be your interpretation of the above boxplot?
The above boxplots depict the locality, spread and skewness of
intl.calls grouped by churn categories.
Comparison of location: Based on the boxplots, we see that the median count of internatinal calls made by churners and non-churners are approximately the same.
Comparison of dispersion: The interquartile range for churners versus non-churners are reasonably similar, as well as the overall range of the data, seen by measuring the whisker lengths.
Comparison of skewness: Both batch of data are positively skewed, however, for churners it is more than for non-churners. Meaning, that churners have a lower count of international calls than that of non-churners.
Comparison of potential outliers: Both data contains outliers beyond the upper whisker, however, nothing deterministic may be drawn from this observation.
As a general conclusion, the above plots do not indicate strong graphical evidence of the predictive importance of international calls. Therefore, it may be that the data mining algorithm would not include it in the model.
In this part, we want to use Exploratory Data Analysis to explore the bank dataset that is available in the R package liver. You could find more information about the bank dataset at the following link on pages 4-5: manual of the liver package; Or here.
Find the best strategies to improve for the next marketing campaign. How can the financial institution have greater effectiveness for future marketing campaigns? To make a data-driven decision, we need to analyze the last marketing campaign the bank performed and identify the patterns that will help us find conclusions to develop future strategies.
Two main approaches for enterprises to promote products/services are:
In general, positive responses to mass campaigns are typically very low (less than 1%). On the other hand, direct marketing focuses on targets that are keener to that specific product/service, making this kind of campaign more effective. However, direct marketing has some drawbacks, for instance, it may trigger a negative attitude towards banks due to the intrusion of privacy.
Banks are interested to increase financial assets. One strategy is to offer attractive long-term deposit applications with good interest rates, in particular, by using directed marketing campaigns. Also, the same drivers are pressing for a reduction in costs and time. Thus, there is a need for an improvement in efficiency: lesser contacts should be done, but an approximate number of successes (clients subscribing to the deposit) should be kept.
A Term Deposit is a deposit that a bank or a financial institution offers with a fixed rate (often better than just opening a deposit account), in which your money will be returned at a specific maturity time. For more information with regards to Term Deposits please check here.
The bank dataset is related to direct marketing campaigns of a Portuguese banking institution. You can find more information related to this dataset at: https://rdrr.io/cran/liver/man/bank.html
The marketing campaigns were based on phone calls. Often, more than one contact (to the same client) was required, to access if the product (bank term deposit) would be (or not) subscribed. The classification goal is to predict if the client will subscribe to a term deposit (variable deposit).
We import the bank dataset:
data( bank ) We can see the structure of the dataset by using the str
function:
str( bank ) 'data.frame': 4521 obs. of 17 variables:
$ age : int 30 33 35 30 59 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
$ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
$ day : int 19 11 16 3 5 23 14 6 14 17 ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
$ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
$ previous : int 0 4 1 0 0 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ deposit : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
It shows that the bank dataset as a data.frame
has 17 variables and 4521 observations. The dataset has 16 predictors
along with the target variable deposit which is a binary
variable with 2 levels “yes” and “no”. The variables in this dataset
are:
age: numeric.job: type of job; categorical: “admin.”, “unknown”,
“unemployed”, “management”, “housemaid”, “entrepreneur”, “student”,
“blue-collar,”self-employed”, “retired”, “technician”, “services”.marital: marital status; categorical: “married”,
“divorced”, “single”; note: “divorced” means divorced or widowed.education: categorical: “secondary”, “primary”,
“tertiary”, “unknown”.default: has credit in default?; binary:
“yes”,“no”.balance: average yearly balance, in euros;
numeric.housing: has housing loan? binary: “yes”, “no”.loan: has personal loan? binary: “yes”, “no”.Related with the last contact of the current campaign:
contact: contact: contact communication type;
categorical: “unknown”,“telephone”,“cellular”.day: last contact day of the month; numeric.month: last contact month of year; categorical: “jan”,
“feb”, “mar”, …, “nov”, “dec”.duration: last contact duration, in seconds;
numeric.Other attributes:
campaign: number of contacts performed during this
campaign and for this client; numeric, includes last contact.pdays: number of days that passed by after the client
was last contacted from a previous campaign; numeric, -1 means client
was not previously contacted.previous: number of contacts performed before this
campaign and for this client; numeric.poutcome: outcome of the previous marketing campaign;
categorical: “success”, “failure”, “unknown”, “other”.Target variable:
deposit: Indicator of whether the client subscribed a
term deposit; binary: “yes” or “no”.Following Part 1, first, report the summary of the dataset then apply the Exploratory Data Analysis.
bank dataset
bank dataset
The below code reports a descriptive statistics of the
bank dataset.
summary(bank) age job marital education default
Min. :19.00 management :969 divorced: 528 primary : 678 no :4445
1st Qu.:33.00 blue-collar:946 married :2797 secondary:2306 yes: 76
Median :39.00 technician :768 single :1196 tertiary :1350
Mean :41.17 admin. :478 unknown : 187
3rd Qu.:49.00 services :417
Max. :87.00 retired :230
(Other) :713
balance housing loan contact day
Min. :-3313 no :1962 no :3830 cellular :2896 Min. : 1.00
1st Qu.: 69 yes:2559 yes: 691 telephone: 301 1st Qu.: 9.00
Median : 444 unknown :1324 Median :16.00
Mean : 1423 Mean :15.92
3rd Qu.: 1480 3rd Qu.:21.00
Max. :71188 Max. :31.00
month duration campaign pdays
may :1398 Min. : 4 Min. : 1.000 Min. : -1.00
jul : 706 1st Qu.: 104 1st Qu.: 1.000 1st Qu.: -1.00
aug : 633 Median : 185 Median : 2.000 Median : -1.00
jun : 531 Mean : 264 Mean : 2.794 Mean : 39.77
nov : 389 3rd Qu.: 329 3rd Qu.: 3.000 3rd Qu.: -1.00
apr : 293 Max. :3025 Max. :50.000 Max. :871.00
(Other): 571
previous poutcome deposit
Min. : 0.0000 failure: 490 no :4000
1st Qu.: 0.0000 other : 197 yes: 521
Median : 0.0000 success: 129
Mean : 0.5426 unknown:3705
3rd Qu.: 0.0000
Max. :25.0000
deposit
deposit
The variable deposit is the target variable, and an
indicator of whether the client subscribed to a term deposit or not.
First, deposit is concluded in a table, and then plotted
on a bar chart.
addmargins( table( bank $ deposit, dnn = c( "Deposit" ) ) ) Deposit
no yes Sum
4000 521 4521
ggplot( data = bank ) +
geom_bar( aes( x = deposit ), fill = c( "red", "blue" ) ) +
labs( title = "Bar plot for the target variable 'deposit'" ) The table and bar chart above reports, that total number of customers that were reached out to was 4521 customers, which is the sum of the height of the bars. Out of the contacted customers, 521 subscribed, and 4000 did not. This comes down to a 11.52% campaign success rate.
age
age
age is a discrete numerical variable, representing the
age of the customer in the dataset. Below ageis plotted as
a bar chart, so that we can observe the age distribution of those the
campaign targeted.
ggplot( data = bank ) +
geom_bar( aes( x = factor( age ) ) ) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
We can report that the target group of the campaign was that of age
between 23 and 60. However, we notice that the bank targeted the age
range of 30 to 40 the most.
Next, we plot age as a regular bar chart, with
deposit overlay, and as a standardised bar chart with the
same overlay.
ggplot( data = bank ) +
geom_bar( aes( x = age, fill = deposit ) ) +
scale_fill_manual( values = c( "red", "blue" ) ) ggplot( data = bank ) +
geom_bar( aes( x = age, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) From the above plot we read that the target group specifically those between age of 23 and 60 produced very similar outcomes. Even the mainly targeted age group of 30 to 40 years of age did not stand out with its high subscription rate. That is, on average the subscription rate is approximately 10%, which is only 1% point apart from the overall subscription rate of the campaign.
Outside the target group, the campaign was more successful. Meaning, that under the age of 23 and beyond 60, the campaign shows at least 37.5% success rate on average. Meaning, if a customer falls in the age range of less than 23 or more than 60, the more likely he/she will subscribe.
Next we plot the data as a boxplot, grouped by
deposit.
ggplot( data = bank ) +
geom_boxplot( aes( x = deposit, y = age ), fill = c( "red", "blue" ) ) Comparison of location: The central location of the two boxplots are approximately shared at 40 years of age. It shows, that the median age for depositors, as well as non-depositors are approximately the same.
Comparison of dispersion: The interquartile range of non-depositors are smaller than that of depositors, and this in general holds for the overall age range as well. That is, the age range for no depositors is narrower than for depositors.
Comparison of skewness: Per above, it was stated that for both groups, the data is positively skewed.
Comparison of potential outliers: Both categeories report some outliers, however, non-depositors have about twice as many outliers than that of depositors.
Based on the above two graphs, we see significant graphical
indication of the importance of age’s predictive
importance. Therefore, we expect the data mining model to include this
variable into the model.
job
job
job is a categorical variable and describes the type of
job the customers have at the time of the campaign.
We first plot job categories as a bar chart with
deposit overlay, so that we can identify the specific
target groups per occupation, and see their corresponding deposit
rate.
ggplot( data = bank ) +
geom_bar( aes( x = factor( job ), fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) ) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))From the above bar chart we can distinguish 3 occupations that stand out in terms of frequency from the dataset. These are jobs of total count greater than 750. Namely, blue-collar workers, management personnel, and technicians. The second largest group has more than 375 counts in the dataset and are of administrative workers, and those working in the service sector. The third group is of every other occupation that represent themselves with less than 250 people.
We standardise the barchart above and replotted it below.
ggplot( data = bank ) +
geom_bar( aes( x = factor( job ), fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))From the standardised plot we read that the campaign was most
effective among the following job categories: retired, students, and
those with unknown occupation. This finding coincides with the previous
variable, age, whereby we saw, that the subscription rate
was high age below 23 and beyond 60 years of age, which tends to be
those who are students and elderly people.
We further note, that the 3 most targeted job groups do not necessarily produce a favourable outcome for the campaign. That is, their subscription rate is low for blue-collar workers at around 5%, 12.5% for management personel and around 11% for those working in services. The latter two job categories do not deviate a significantly from the overall success rate of the campaign.
For the rest of the occupations, we see deposit rate ranging between
6% and 12%. Since we deemed age an important predictor, we
claim that there is strong graphical indication for job
being a relevant explanatory variable for target variable
deposit. Therefore, we expect job to be
included in the data mining model too.
marital
marital
The variable marital stands for marital status of the
customer. The attribute can take either of the following stautses
“married”, “divorced”, “single”, “divorced” which means divorced or
widowed.
We first plot marital status with a deposit overlay, and
then standardised in a bar chart to observe it there are any noticeable
patterns.
ggplot( data = bank ) +
geom_bar( aes( x = marital, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) ) ggplot( data = bank ) +
geom_bar( aes( x = marital, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) ) From the above plots we see that those with married status are over-represented in the dataset with a count of approximately 2800. The dataset contains approximately 1200 singles, while the rest statuses are approximately 500 that are either divorced or widowed.
The standardised plot allows us to read the deposit proportions among marital statuses; we read that the success rate of the campaign across different marital statuses are similar with ranges between 10% and 12%. We further interpret, that married customers are less keen on subscribing than those with either of the other marital statuses.
Since the variable martial does not show strong
graphical indication of predictive importance, it is likely that the
data mining algorithm will not include it in the model.
education
education
The education variable has four categories; “primary”,
“secondary”, “tertiary” and an “unknown”.
We explore education by means of regular bar chart and
standardised barchart with deposit overlay.
ggplot( data = bank ) +
geom_bar( aes( x = education, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) ) ggplot( data = bank ) +
geom_bar( aes( x = education, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )The above bar charts show that customers with secondary educations are represented the most in the data set followed by tertiary education, and primary education participants.
Upon analysing the standardised chart, we report that those with tertiary education tend to subscribe with higher likelihood than customers with different educational level. Additionally, we notice that the success rate of the campaign is almost identical for those with primary, secondary and unknown education.
Although, the graphical evidence is not strong it is difficult to
conclude whether or not the variable education is
significant enough to include in the model.
default
default
The default variable is a binary attribute that stands
for whether the customer has defaulted on his/her credit or not.
We explore the default through bar charts: first as a
regular chart with a deposit overlay, and then as a
standardised bar chart.
ggplot( data = bank ) +
geom_bar( aes( x = default, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) ) ggplot( data = bank ) +
geom_bar( aes( x = default, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )We read from the regular bar chart that those who did not default on their credit are over-represented in the dataset by 4445 customers, and only 76 have defaulted. It is understandable from the standpoint of the bank that they target those with good credit history since those customers tend to be more reliable and have the necessary financial means to subscribe for the deposit.
The standardised chart reveals that there is no difference in the
deposit rate between the two groups, therefore, the EDA of
default is inconclusive. I.e., default does
not indicate strong graphical evidence of any predictive importance.
Therefore it is likely that the data mining algorithm will not include
that in the prediction model.
balance
balance
The balance attribute represents the average balance of
the customer’s account annually.
ggplot( bank ) +
geom_histogram( mapping = aes( x = balance ))One’s balance is closely related to their income. The general consensus over income distribution is that it is positively skewed, we expect balance to be similarly shaped.
From the above histogram we can confirm that balance is indeed positively skewed, with a mean balance of €1423. The wide range of balances gives away that the dataset contains unusually high balances, i.e., chances are that there are outliers in the dataset. Furthermore, we report that balance may also be negative, representing an overdraft, which occurs when money is withdrawn in excess of what is in a current account.
Balances around 0 dominate the data, but to see that rest of the distribution in more details we zoom into it by adjusting the limit of the y-axis to [0, 1000].
ggplot( bank ) +
geom_histogram( mapping = aes( x = balance )) +
coord_cartesian( ylim = c( 0, 1000 ) )Unfortunately the zoomed in part does not provide necessarily more
information but confirms the existence of outliers between the range of
€40000 and €70000. To see the potential outliers, we plot
balance as a boxplot.
ggplot( data = bank ) +
geom_boxplot( aes( x = balance))We identify the outliers represented by the dots in the above plot. To draw any meaningful interpretation, we need to handle outliers by means of imputation, and then replot the graph.
q1 = boxplot(bank $ balance)$stats[2, ]
q3 = boxplot(bank $ balance)$stats[4, ]iqr = q3 - q1
whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr
bank = mutate( bank, balance = ifelse( balance < whisker_lower | balance > whisker_upper, NA, balance ) )
bank $ balance = impute( bank $ balance, 'random' )
ggplot( data = bank) +
geom_histogram( aes( x = balance, fill = deposit ), position = "stack") +
scale_fill_manual( values = c( "red", "blue" ) ) The plotting of the imputed data shows symmetry. Also, we read from the chart above, that the count for the deposit follows the shape of the overall histogram, from which we infer that the deposit rate across the several balances is approximately constant. To confirm that, we plot it as a standardised histogram.
ggplot( data = bank) +
geom_histogram( aes( x = balance, fill = deposit ), position = "fill") +
scale_fill_manual( values = c( "red", "blue" ) ) The above expectation set is confirmed. The subscription rate of the
campaign is between 9% and 12% and is approximately constant over the
difference account balances. Therefore, there is no strong graphical
evidence that balance is an important factor in determining
deposit. Therefore, we do not expect the data mining
algorithm to incorporate the balance variable in the
model.
housing
housing
Given that the housing is a binary variable, that stands
for whether the person has a mortgage, we can first make a contingency
table with the target variable, as well as plot housing as
a regular bar chart then standardised it with a deposit
overlay.
addmargins( table( bank $ deposit, bank $ housing, dnn = c( "Deposit", "Housing" ) ) ) Housing
Deposit no yes Sum
no 1661 2339 4000
yes 301 220 521
Sum 1962 2559 4521
ggplot( data = bank ) +
geom_bar( aes( x = housing, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )Both the contingency table, and the bar chart gives away that those with housing loan represent the majority of customers. To be able to judge the proprtions between loan takers and deposit subscribers, we move on to the standardised bar chart.
ggplot( data = bank ) +
geom_bar( aes( x = housing, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )We see that approximately 15% of those without a mortgage subscribe
for the deposit deal, while only 8% of loan holders subscribe. The
difference between subscribers with and without housing loan is
approximately two fold, which constitute to a significant difference.
Therefore, it is reasonable to assume that the data mining algorithm
will incorporate housing into the model.
loan
loan
Similar to housing, loan is a binary
variable as well, therefore we can follow the exact same analysis that
we did for housing.
That is, first we report a contingency table, and then plot the data on a bar chart.
addmargins( table( bank $ deposit, bank $ loan, dnn = c( "Deposit", "Loan" ) ) ) Loan
Deposit no yes Sum
no 3352 648 4000
yes 478 43 521
Sum 3830 691 4521
ggplot( data = bank ) +
geom_bar( aes( x = loan, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )From the above chart we see that approximately 3800 customers do not have a personal loan, which represents a significant majority of the customers, while the rest 700 do have a loan.
The proportions from the regular bar chart is not clear, therefore, we look at the standardised equivalent of the above chart.
ggplot( data = bank ) +
geom_bar( aes( x = loan, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )We read, that the success rate of the campaign among customers
without a personal loan is about 12.5%, while it is only 6% for those
with a personal loan. In this case the difference is more than two fold
between the success rates for the two groups, hence we shall expect the
data mining algorithm to include loan in the model.
contact
contact
The contact variable describes the device on which the
customer was contacted. There are 3 categories, that we interpret as
follows: the cellular phone refers to the customers’ wireless mobile
phone, while telephone to landline phone. The third category is unknown,
which may be via indirect ways, such as brochures, billboards etc.
It is a categorical variable, hence we can plot it on bar chart.
ggplot( data = bank ) +
geom_bar( aes( x = contact, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )We see that the majority of customers were reached out to by their cellular phone, or an unknown way.
ggplot( data = bank ) +
geom_bar( aes( x = contact, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )The standardised chart shows that there is technically no difference between the success rate of cellular and mobile phones, however, the campaign was is not deemed successful on the unknown group.
Cellular and telephone performed beyond the overall campaign’s success rate, therefore, it may be advised to the campaign team to allocate the resources from the unknown contact way to cellular of telephone to elevate the subscription rate.
The above plot indicates strong graphical evidence of the predictive
importance of the device, therefore, the data mining algorithm may as
well select contact into the model.
day
day
The day variable represent the given day in a month.
When plotted as a bar chart, we see the last contact day’s
frequency.
ggplot( data = bank) +
geom_bar(aes( x = day, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )We see that the contact points are more frequent in the first 2/3 of the month compared with the last 1/3 of it. We standardise the plot to be able to read the proportions of subscription rate precisely.
ggplot( data = bank) +
geom_bar(aes( x = day, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )From the standardised plot, we see little to no pattern that would
allude graphical evidence that day may have predictive
importance in deposit. There are few days, that stand out
in terms of success, but it may very well be just random noise.
month
month
The month variable indicates the last contact point in
terms of month. First we plot it as a regular bar chart to identify the
number of contact points made per month.
ggplot( data = bank) +
geom_bar(aes( x = month, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )We see, that there are 5 months that are significantly different in terms of the number of contacts reached out to, and these are May, June, July, August, and November. These are the months, where the campaign was the most aggressive and reached out to many contacts. Next we standardise the chart to see the success rate of the campaign per month.
ggplot( data = bank) +
geom_bar(aes( x = month, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )We see from the above plot that March, September, October and
December were the most successful months, and were able to campaign with
more than 40% success rate on average. With a little bit lower
subscription rate are February and April; their respective success rate
are 15% to 20%. We see that the above plot indicates strong graphical
evidence of the predictive importance of month.
duration
duration
The duration variable is the last contact’s duration
expressed in seconds. We plot the data as a histogram with bin size of
60 seconds, so that each bar in the chart would represent a minute of
duration.
ggplot( bank ) +
geom_histogram( mapping = aes( x = duration ), binwidth = 60 )
The wide range of the x-axis tells that the data has some outliers.
Furthermore, the distribution of
duration is positively
skewed, which is expected, since most of the customer tend to keep
promotional/marketing calls as short as possible.
Next we query some descriptive statistics on the variable in question.
summary(bank $ duration) Min. 1st Qu. Median Mean 3rd Qu. Max.
4 104 185 264 329 3025
We can report that the mean duration of the phone calls were 264 seconds. We can see that 25% of customers kept the phone call under 104 seconds, 50% of them under 185 seconds while 75% just under 329 seconds.
To identify outliers of the data, we will plot it as a boxplot
groupped by deposit.
ggplot( data = bank ) +
geom_boxplot( aes(y = duration ) ) The above boxplot identifies outliers beyond 666.5 seconds. Any
observation beyond that duration is deemed an outlier. Next, to deal
with the outliers, we replace the outliers with NA, impute data, and
finally re-plot the graph groupped by deposit.
q1 = boxplot(bank $ duration)$stats[2, ]
q3 = boxplot(bank $ duration)$stats[4, ]iqr = q3 - q1
whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr
bank = mutate( bank, duration = ifelse( duration < whisker_lower | duration > whisker_upper, NA, duration ) )
bank $ duration = impute( bank $ duration, 'random' )
ggplot( data = bank) +
geom_histogram( mapping = aes( x = duration, fill = deposit) , binwidth = 60 ) +
scale_fill_manual( values = c( "red", "blue" ) )
It is worth to note, that the imputed data held its positively skewed
shape, however, the the number of successful contacts per minute follows
a symmetric distribution, with a peak at approximately 220 seconds.
Next we standardise the above histogram.
ggplot( data = bank) +
geom_histogram( mapping = aes( x = duration, fill = deposit) , position = "fill", binwidth = 60 ) +
scale_fill_manual( values = c( "red", "blue" ) )From the standardised histogram we clearly identify that the longer the duration of the call, the higher the proportion of subscribers for deposit.
Therefore, there is a strong graphical evidence of the predictive
importance of duration, so we expect the data mining
algorithm to incorporate this variable into the model.
campaign
campaign
The campaign variable stands for the number of contacts
performed during the campaign for specific contacts.
First, we plot the data as a bar chart.
ggplot( data = bank) +
geom_bar(aes( x = campaign, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )
We identify, that number of times customers were contacted is heavily
positively skewed. That is, most of the targeted people were only
contacted a few times. Furthermore, based on the width of the x-axis, we
see a clear indication of outliers. To deal with the outliers, we first
plot the data on a boxplot, confirm the existense of outliers, and only
then mutate and impute the data and finally re-plot it.
ggplot( data = bank ) +
geom_boxplot( aes( x = campaign) ) The above boxplot confirms the existene of outliers, represented by the dots beyond the right whisker.
q1 = boxplot(bank $ campaign)$stats[2, ]
q3 = boxplot(bank $ campaign)$stats[4, ]iqr = q3 - q1
whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr
bank = mutate( bank, campaign = ifelse( campaign < whisker_lower | campaign > whisker_upper, NA, campaign ) )
bank $ campaign = impute( bank $ campaign, 'random' )
ggplot( data = bank) +
geom_bar(aes( x = campaign, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )
We report that observations beyond count 6 contact points are
eliminated, yet the original distribution skewness kept its shape. That
it, after the imputation, the data shows that the majority of customers
were reached out only a few times.
Next we standardise the above distribution to report the subscription rate per the number of contact points in the campaign.
ggplot( data = bank) +
geom_bar(aes( x = campaign, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )
The above plot shows that the subscription rate is approximately
constant during the first 4 contact points, at around 10% on average. To
be more precise, we see that the first contact point yield a bit over
12.5% success rate, while it drops to around 10% for the second and
third call, and eventually picks up to again 12.5% for the fourth
call.
The major drop is after the fifth contact point, where the success rate is consistently below 10%
Although there are some learning of the above graphs, the plot does
not indicate strong graphical evidence of predictive importance of
campaign.
pday
pday
pday stands for the number of days passed since the
customer was last contacted. The field value may take on
-1, which means that the customer was not contacted
previously, and the current campaign contact is the first.
First, we report the frequency of each day passed as a bar chart.
ggplot( data = bank) +
geom_bar(aes( x = pdays, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )We not only see that the values along the x-axis spreads out on a
wide range but also, that there are a significant overweight of
-1 observations, referring to that the the majority of
customers were not contacted previously.
It is an advised to filter out those contacts from the data who were
not reached out earlier, otherwise we areunable to draw a meaningful
conclusion whether the pday is deterministic for the
outcome for deposit
We achieve this by plotting pdays only for those who
were reached out in the previous campaign.
ggplot( data = bank) +
geom_bar( data = subset( bank, pdays != -1 ), aes( x = pdays, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )Next, we plot a boxplot for the same observations, to see the outliers.
ggplot( data = bank ) +
geom_boxplot( data = subset( bank, pdays != -1 ), aes(y = pdays )) The boxplot identifies some outliers, that we first replace by NA values and then impute them, and finally re-plot it as a bar chart.
bank2 = mutate( bank, pdays = ifelse( pdays == -1, NA, pdays ) )
q1 = boxplot(bank2 $ pdays)$stats[2, ]
q3 = boxplot(bank2 $ pdays)$stats[4, ]iqr = q3 - q1
whisker_lower = q1 - 1.5 * iqr
whisker_upper = q3 + 1.5 * iqr
bank2 = mutate( bank2, pdays = ifelse( pdays < whisker_lower | pdays > whisker_upper, NA, pdays ) )
bank2 $ pdays = impute( bank2 $ pdays, 'random' )
ggplot( data = bank2) +
geom_bar( aes( x = pdays, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )From the imputed data we see, that the majority of customers who were contacted previously, were contacted again for a follow up call between 35 and 375 days.
To see the success rate of each of the pdays, we plot it
as a bar chart grouped by deposit.
ggplot( data = bank2) +
geom_bar(aes( x = pdays, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )Based on the above standardised graph, we see no strong graphical
evidence of the predictive importance of pdays. Therefore,
we do not expect the data mining algorithm to include pdays
in the model.
previous
previous
The variable previous stands for the number of contacts
performed before the current campaign. First we plot it as a bar
chart.
ggplot( data = bank) +
geom_bar(aes( x = previous, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )We see that the majority of the customers were not contacted
previously. This observation coincides with the findings in
pdays, where these observations were denoted by the value
of -1.
Next, we standardise the data.
ggplot( data = bank) +
geom_bar(aes( x = previous, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )From the above plot we read, that the more the customer were
contacted in the previous campaign but less than 10 times, then on
average the likelihood of subscribing is higher for the current
campaign. Although there are some valleys in the deposit rate for
previous within the range of [0, 10], but the overall trend
is upward sloping.
In this case, we find strong enough graphical evidence that previous
is may serve as a good predictor for deposit, and hence
expect the data mining algorithm to incorporate previous in
the model.
poutcome
poutcome
poutcome represents the outcome of the previous
campaign, and can take on either of the following 4 values: “failure”,
“success”, “unknown”, “other”.
We first plot it as a bar chart.
ggplot( data = bank) +
geom_bar(aes( x = poutcome, fill = deposit ), position = "stack" ) +
scale_fill_manual( values = c( "red", "blue" ) )
We see that the unknown category has the highest count, which again is
likely to represent the customers that were not contacted previously,
and thereby coincides with data of
pdays and
previous.
Next, we plot poutcome as a standardised bar chart.
ggplot( data = bank) +
geom_bar(aes( x = poutcome, fill = deposit ), position = "fill" ) +
scale_fill_manual( values = c( "red", "blue" ) )We interpret the above graph as that those previously subscribed for
the deposit are more likely to subscribe again in the current campaign.
The above plot indicates strong graphical evidence of the predictive
importance of poutcome, and so the data mining algorithm
will include poutcome in the model.
In the above exploratory data analysis we analysed 17 variables, out
of which we found the following ones graphically indicative enough to
consider as a relevant predictor for deposit.
age: Customers under age of 23 or above 60 are more
likely to subscribe.
job: Customers with unknown, student or retired
occupation are the most likely to deposit. Next to that, customers with
administrative, housemaid, management, or technician occupation are the
second most likely to deposit.
education: Customers with a tertiary educational
background are the most likely to subscribe for the deposit
deal.
housing: Customers with a mortgage tend to subscribe
twice as likely as those without a mortgage.
loan: Customers with a personal loan tend to
subscribe twice as likely as those without a personal loan.
contact: Customers reached out via cellular or
telephone tend to be a subscriber more than those reached by unknown
means.
month: Customers that were contacted last in March,
September, October or December are the most likely to respond positively
to the campaign and subscribe.
duration: The higher the duration of the phone call,
the more likely the customer subscribes.
previous: The more times the customer was contacted
in the previous campaign (but less than 10 times), the more likely
he/she will subscribe in the current campaign.
poutcome: If the customer subscribed for the deal in
the previous campaign, it is likely the he/she will do it again in the
current campaign.
In this part, you could apply Exploratory Data Analysis to explore your own dataset. You could follow the same steps as in part 1 (above) of these exercises.